Data Perturbation Independent Diagnosis and Validation of Breast Cancer Subtypes Using Clustering and Patterns

Molecular stratification of disease based on expression levels of sets of genes can help guide therapeutic decisions if such classifications can be shown to be stable against variations in sample source and data perturbation. Classifications inferred from one set of samples in one lab should be able to consistently stratify a different set of samples in another lab. We present a method for assessing such stability and apply it to the breast cancer (BCA) datasets of Sorlie et al. 2003 and Ma et al. 2003. We find that within the now commonly accepted BCA categories identified by Sorlie et al. Luminal A and Basal are robust, but Luminal B and ERBB2+ are not. In particular, 36% of the samples identified as Luminal B and 55% identified as ERBB2+ cannot be assigned an accurate category because the classification is sensitive to data perturbation. We identify a “core cluster” of samples for each category, and from these we determine “patterns” of gene expression that distinguish the core clusters from each other. We find that the best markers for Luminal A and Basal are (ESR1, LIV1, GATA-3) and (CCNE1, LAD1, KRT5), respectively. Pathways enriched in the patterns regulate apoptosis, tissue remodeling and the immune response. We use a different dataset (Ma et al. 2003) to test the accuracy with which samples can be allocated to the four disease subtypes. We find, as expected, that the classification of samples identified as Luminal A and Basal is robust but classification into the other two subtypes is not.


Introduction
Breast cancer (BCA) is a common and heterogeneous disease affecting women of all ages. Its occurrence is correlated with levels of estrogen (ER), progesterone (PR) and Her2neu (ERBB2) (Gruvberger et al. 2001;Lacroix and Leclercq 2005). Clinically, BCA is classifi ed into two major subtypes: ER+ and ER-. These groups are sometimes stratifi ed further by ERBB2 and/or PR levels. Across all treatments, ER+ and/or PR+ patients have a better prognosis than ER-and/or PR-tumors (Anim et al. 2005) and are also more likely to respond to hormone therapy (e.g. tamoxifen). Over-expression of ERBB2, seen in 25-30% of cases, is often a marker of aggressive disease, poor prognosis and mixed treatment results (Diermeier et al. 2005).
In spite of sustained research and medical and pharmaceutical effort, the incidence and death rate of BCA remains high. In 2005, more than 1.2 million new cases were diagnosed world wide and more than 20% of these will die from the disease (http://imaginis.com/breasthealth/). A major cause of treatment failure is that tumors with similar histopathology have divergent clinical courses and prognoses. The goal of the present study is the same as that of many others (Bieche et al. 1995;West et al. 2001;van't Veer et al. 2002;Honig et al. 2004;Ahnstrom et al. 2005;Sharma et al. 2005;Osipo et al. 2005), that molecular profi ling of BCA will clarify molecular correlates of disease, and this in turn will improve choice of therapy, and provide leads to new and more effective therapeutics.
In a series of papers on analysis of cDNA data of BCA tissue samples (Sorlie et al. 2001;Perou et al. 2000Perou et al. , 2001 the samples were uniquely assigned to one of four distinct categories: Luminal A, Luminal B, ERBB2+ (or Her2+) and Basal-like. These subtypes were later validated by Sotiriou et al. 2003, Loi et al. 2005and Kristensen et al. 2005. The fi rst two categories were mostly ER+ and the latter two mostly ER-negative. In the original analysis of Perou et al. 2000, Basal tumors were characterized by high levels of keratins 5 and 17, laminin, and fatty acid binding protein 7 genes (see also Charafe- Bhanot et al Jauffret et al. 2005), whereas ERBB2+ was characterized by high levels of several genes in the ERBB2 amplicon at 17q12.21 including ERBB2 and GRB7. Other studies identified different markers (Abd El-Rehim et al. 2005;Bertucci et al. 2005;Farmer et al. 2005;Hu et al. 2006;Sorlie et al. 2006) and a consensus set of markers for all BCA patients is not currently available.
Luminal and Basal-like tumors arise in distinct breast tissue cell types (Perou et al. 2000) and have very different disease course (Sorlie et al. 2001(Sorlie et al. , 2003 and response to therapeutics (Troester et al. 2004;Bertucci et al. 2005). The Luminal A subtype has the best overall prognosis followed by Luminal B while the other two subtypes are more aggressive and diffi cult to treat. The nomenclature of these subtypes has found its way into the language and culture of clinical practice and affects treatment options offered to patients. This makes it important to validate the stability of the original classifi cation of Sorlie et al. This is the main goal of the present paper.
The original analysis used simple hierarchical clustering (Eisen et al. 1998) which is known to be sensitive to data perturbation (Monti et al. 2003;van der Kloot et al. 2005). We re-analyzed the data using a robust averaging procedure to access the stability of imposing fi ve clusters (4 disease subtypes + Normal) on the data. The goal was to identify a "core" set of samples in each subtype which were stable under data perturbations, and to use these cores to determine "patterns" of gene expression for each core. We found stable core clusters for samples in the Luminal A, Basal and Normal clusters of the original analysis. However, the "Luminal B" and "ERBB2+" clusters of Sorlie et al. were unstable, with only a subset of the samples from the previous assignment remaining in stable core clusters under data perturbation. Instead, the originally assigned samples scattered over two or more clusters. This suggests that the Luminal B and ERBB2+ clusters (and their markers) as identifi ed in Sorlie et al. 2003, are unstable to data perturbation and need further analysis.
For the Luminal A and Basal categories, we fi nd a robust set of gene markers and patterns. If we combine the Sorlie et al. dataset with a new dataset from Ma et al. and cluster the combined data using these robust gene markers and patterns, then in the new data, we can assign a robust subtype label for Luminal A and Basal but not for the other two disease phenotypes.

Datasets
Data 1: The cDNA dataset of (Sorlie et al. 2003) was obtained from http://genome-www.stanford. edu/breast_cancer/robustness/data/SupplText. html. The data had expression levels of N = 552 genes for M = 122 samples of which 112 were from BCA patients and 10 controls. The 552 genes were selected by Sorlie et al. to have small variation in tissue samples from the same patient and a high variation in tissue samples from different patients. Data 2: The Ma et al. dataset was downloaded from www.geneexpression_ma.org. It consisted of expression levels of 1940 genes for 93 samples micro-dissected from 36 BCA patients and 3 normals. The samples were from three stages of disease: atypical ductal hyperplasia or ADH, ductal carcinoma in situ or DCIS and invasive ductal carcinoma or IDC respectively. The genes made available in the data were chosen by linear discriminant analysis as markers for breast cancer progression. ER, PR and HER2neu levels measured through immunohistochemistry were available.

Preprocessing and Imputation for Data 1
The matrix of samples (columns) and genes (rows) was normalized to mean 0 and variance 1 fi rst across columns and then across rows, ignoring missing entries. The matrix had 5,027 missing entries. We fi rst eliminated genes and samples with more than 20% missing entries. This reduced the data to N = 530 genes and M = 118 samples. We imputed the missing entries using a simple generalization of the kNN method of Troyanskaya et al. 2001 as follows: We identifi ed the k nearest neighbor entries for missing entry x i j using the Euclidean metric, 1 2 = -l l ! with the requirement that the genes chosen as nearest neighbors have at least t% fi lled entries. Twenty imputations were done at each x i j using the range 10 ≤ k ≤ 14 for k and varying t from 50% to 80% in increments of 10. Let {x 1 , x 2 , …, x k } be the k-nearest neighbor entries in increasing order of distance and R be a uniform random number in (0,1). Then the imputed value Diagnosis and Validation of Breast Cancer Subtypes y is given by y = x j , which satisfi es Twenty datasets were generated in this way, one for each (k,t) value. The clustering was averaged over these twenty datasets in order to create a set of clusters insensitive to parameter choice in data imputation. This averaging is an improvement over the kNN method because it is stable to both variation in k and variation in how the neighbors are chosen (as measured by t). Multiple clones in the data were eliminated by averaging after discarding outliers outside a 95% confi dence interval. This process left 523 genes with no missing entries or clones. The fi nal data is given in Supplementary  Table 1.

Identifying "Core" Clusters
We use the letters A, B, C, D, E to denote the fi ve phenotypes: Luminal A, Luminal B, ERBB2+, Basal, and Normal respectively. The clusters were identifi ed using the consensus hierarchical clustering technique of Monti et al. 2003 implemented in GenePattern (http://www.broad.mit.edu/cancer/ software/genepattern/). This method assesses the stability of hierarchical clustering across multiple perturbations of the data. We generated 100 copies of the dataset by randomly selecting 80% of the samples. Each copy was hierarchically clustered using a Euclidean distance metric and the top 5 clusters were selected. For each distinct sample pair (i, j) in the data, we computed the frequency F ij with which the pair clustered together over the 100 copies of the datasets. The matrix of F ij values is called the "agreement matrix." Repeating this for all 20 data imputations and averaging gave the fi nal "consensus agreement matrix" which is shown in Supplementary Table 2. The five core clusters were identified as bicliques (Alexe et al. 2004) using the agreement matrix entries as a measure of similarity. We used the criterion that two samples have the same phenotype and belong to the same core cluster if they have a consensus agreement matrix score greater than P. For the Luminal A and Basal subtypes, the value P = 90% was suffi cient to get an exact match between the core cluster identifi ed by us and the assignment in Perou et al. 2000 andSorlie et al. 2003. However, for samples assigned to Luminal B and ERBB2+ by the earlier study, these thresholds needed to be lowered to 50% and 25% respectively to get agreement with the previous assignments, suggesting that these categories are considerably less stable to data perturbation. The fi ve core clusters contained 60 out of the 118 samples.
From the F ij values, we defi ne the average agreement score between a sample i and other samples j in a given cluster C as where j = 1, ... , n, and n is the number of samples in the cluster C. F i,C was calculated for each of the fi ve clusters. The results are shown in Figures 1 a-e. For each phenotype, we used a cutoff criterion on F i,C to assign it to the corresponding core cluster and these samples are shown in color. Many samples earlier identifi ed as Luminal B also have a high score in our Basal core cluster (Figure 1b and 1d). This suggests that the Luminal B identifi cation is problematic. Figure 1e also shows that some samples identifi ed earlier as Luminal A are placed in our "Normal" core cluster, suggesting that these patients may have minimal disease. Overall, our analysis shows that Luminal A, Basal and Normal phenotypes are robustly classifi able into homogeneous clusters but Luminal B and ERBB2+ do not cluster well. We fi nd that 36% of the samples previously placed in the Luminal B category and 55% of samples previously classifi ed as ERBB2+ are in fact ambiguous; i.e., their assignments are highly sensitive to data perturbation and they should be reanalyzed or classifi ed as ambiguous. The scores of some unclassifi ed samples in Sorlie et al. 2003 are shown in Figure 1f. For the samples where these scores are higher than the cutoff in one core cluster but not in any other, the corresponding sample can be assigned a category label by our clustering.

Bhanot et al
The agreement fraction between the original assignment and our assignments is highest for the Normal, Luminal A and Basal categories and lower in the other two phenotypes. For each sample i in a core cluster, we also calculated the silhouette score (Rousseeuw,1987) defi ned by where a(i) is the average dissimilarity between i and all other samples in the cluster, and b(i) is the minimum average dissimilarity of i to all samples in other clusters. If s(i) values in a cluster are close to unity, the cluster is well defi ned. An s(i) value near zero indicates that the sample is between two clusters. Negative values of s(i) mean that the sample is in the wrong cluster. The "silhouette width" of a cluster is the average of the s(i) scores of all samples in that cluster. The silhouette widths for our core clusters as well as for the Sorlie et al. clusters are given in Table 1. The low values of the average silhouette scores are worrisome. They suggest either that the stratifi cation into these phenotypes is problematic or that a better choices of genes is necessary to separate the phenotypes more reliably.

Identifying Robust Gene Markers
Microarray datasets suffer from an overabundance of genes, most of which do not contribute to the signal. Identifying differentially expressed genes for a given set of phenotypes is a diffi cult problem for which many methods have been proposed. These can be divided into two major groups (Guyon and Ellisseeff, 2003, Inza et al. 2004, Lai et al. 2006, Jeffery et al. 2006) for supervised learning: (i) Filtering or Variable Ranking methods: These select features based on quality scores. They include the fold change test (e.g. Mutch et al. 2002;Breitling and Herzyk, 2005), the t-test (Gossett, 1908, Tusher et al. 2001, the Wilcoxon-Mann-Whitney test (Bradley, 1968;Lehman, 1975), the Signal-to-Noise Ratio (SNR) test (Golub et al. 1999), the J5 test (Patel and Lyons-Weiler, 2004), the D1 test (Patel and Lyons-Weiler, 2004) etc. Another set of methods measure the "separability" of data into different phenotype classes. These include simple separability (Patel and Sorlie et al. 2003. The numbers of samples assigned to each phenotype by the original classifi cation and by our clustering are shown in columns 5 and 6. We see that a larger fraction of assignments into the phenotypes Normal, Luminal A and Basal are correct. The silhouette scores are given in columns 7 and 8. Sorlie et al. 2003 This study core cluster Sorlie et al. 2003 This study core cluster Sorlie et al. 2003 This study core cluster  -Weiler, 2004), weighted separability (Patel and Lyons-Weiler, 2004), envelope eccentricity (Alexe et al. 2006), separation measure (Alexe et al. 2006b) etc. A third class uses informationtheoretic methods such as the entropy criterion (e.g. Furlanello et al. 2003;Liu et al. 2005), mutual information (e.g. Tourassi et al. 2001), information gain (Liu, 2004) etc. Finally, there are the statistical impurity measures (Su et al. 2003) which include the two-ing rule, the Gini index, max-minority, sum-minority, sum-of-variances etc.

Lyons
(ii) Feature Subset Selection Methods: One such method selects those features which are useful for classifi cation for a given machine learning algorithm (e.g. SVM (Vapnik, 1998), ANN (Bishop, 1995), kNN (Ripley, 1996) etc). More sophisticated approaches are embedded methods which include the selection of features as part of the training process for the classifi er. These methods are computationally intensive and require effi cient search strategies or a preliminary fi ltering of the non-reliable genes to reduce the dimensionality of the problem.
The existence of such a variety of feature selection methods poses a challenge in microarray data analysis. There have been recent attempts to combine various approaches into a meta selection procedure based on "majority-voting" using ranking by predictive content across many data perturbations and machine learning methods (e.g. Bhanot et al. 2005;Alexe et al. 2005a). Several studies (Guyon and Ellisseeff, 2001;Alexe et al. 2005b) have shown that variables which are only weakly correlated with phenotype are very useful when used in combinations. This principle has lead to the development and study of combinatorial markers or patterns (Crama et al. 1988;Bhanot et al. 2005;Alexe et al. 2006b).
In the present study, we have chosen to use a single feature selection method (namely the SNR test, Golub et al. 1999) which has been shown (Alexe et al. 2006b) to have good performance on genomic and proteomic data. However, we cannot guarantee that it is the best method, particularly because of the need to impute the missing data in the dataset of Sorlie et al. As an added check on the feature selection, we also use the combinatorial "pattern" method and averaging over data perturbations to reduce the errors from potentially "less than optimum" choice of features.
We identifi ed a large pool of uni-gene markers for each core that distinguish it from the others using the signal-to-noise statistic. For gene i, if μ 1 (i) and μ 2 (i) be the average gene expression levels for the core and its complement and σ 1 (i) and σ 2 (i) the corresponding standard deviations, the signal-to-noise ratio (SNR) is defi ned as SNR = (μ 0 -μ 1 )/(σ 0 + σ 1 ). The t-test statistic is the same as the SNR except that the denominator is (σ 0 2 + σ 1 2 ) 1/2 . Since (σ 0 + σ 1 ) > (σ 0 2 + σ 1 2 ) 1/2 SNR is a more conservative criterion than the t-test.
The SNR statistic is preferred over the t-test in situations when the sample size in a class is small (less than 30) because it does not assume a Gaussian distribution for the underlying variables; an assumption which is implicit in the t-test. When combined with a permutation test for measuring p-values, the SNR statistic is a powerful and widely used technique for feature selection and class discrimination (e.g. Golub et al. 1999;Ramaswamy et al. 2001;Shipp et al. 2002;Sun et al. 2004;Goh and Kasabov 2005;Monti et al. 2005) and is implemented in several software packages (e.g. GenePattern and Gene Set Enrichment Analysis (GSEA), http:// www.broad.mit.edu/tools/software.html).
The signal-to-noise (SNR) was computed for each gene for each of the 20 imputed datasets and for each of the 60 leave-one-out sample perturbation experiments for the core samples. The selected genes were those whose p-value for the SNR was below 0.01 and the signifi cance of the SNR for false discovery rate (FDR)  was above 0.95 in each experiment.
This procedure identifi ed 391 robust uni-gene markers (given in Supplementary Table 3) for the fi ve core clusters. They consisted of overlapping sets of genes, 238 for Luminal A, 234 for Basal, 66 genes for Luminal B, 35 genes for ERBB2+ and 118 genes for Normals. These included many genes identifi ed in previous studies (Perou et al. 2000;Sorlie et al. 2003;Loi et al. 2005). For example, the Luminal A set included the known estrogen pathway genes (ESR1, LIV1, GATA-3) and the Basal set the known genes CCNE1, LAD1, and KRT5.
We further reduced this pool to 148 genes using the more stringent criteria which used the significance of the SNR for several metrics: the false discovery rate, the Q value , FWER (Dudoit et al. 2002), Bonferroni correction . More details about the multiple testing metrics we used are given in Supplementary Information I. These 148 genes included 79 genes for Luminal A and 60 for Basal with an overlap of 31 genes. The other phenotypes Table 2a. Collection of uni-gene markers for the Luminal A phenotype. The markers are sorted in decreasing order with respect to to the signal-to-noise ratio.

Patterns (Multi-gene Markers) for the Core Clusters
The complexity of BCA makes it unlikely that single genes can predict phenotype. Instead, one expects combinations of genes to be better at identifying phenotype. Consequently, we used " patterns" (as defi ned in Crama et al. 1988;Alexe and Hammer, 2005;Bhanot et al. 2005) to distinguish the core clusters. A pattern is a set of linear constraints on the expression levels of a group of genes satisfi ed by many samples in a particular cluster and by few samples in other clusters. For example, the pattern P A below is satisfi ed by all samples in the "Luminal A" cluster and by none of the non-Luminal A samples: For illustration, Figure 3 shows two patterns P A and N A , in the 2-d expression plane for GATA3 and Liv-1. A pattern is characterized by its degree, prevalence, and homogeneity. The degree is the number of genes appearing in its defi ning conditions. The prevalence of a pattern is the percent of positive (negative) cases which satisfy the pattern. The homogeneity of a pattern is the percentage of positive (negative) cases covered by it. In general, patterns useful for classifi cation have low degree and high prevalence and homogeneity.
We identified all patterns for the 60 core samples over the selected 148 genes by applying the combinatorial algorithm described in (Alexe and Hammer, 2005). Briefl y, each sample from a core cluster was placed in a box by defi ning cuts in gene expression space which distinguish it from the samples belonging to other core clusters. The boxes were then merged by extending them along Table 2e. Collection of uni-gene markers for the Normal phenotype. The markers are sorted in decreasing order with respect to to the signal-to-noise ratio.

Gene index
Gene Description GeneBank Acc SNR FDR   Figure 3. An example of a pattern (pattern P A ) characteristic of the Luminal A core cluster (Cluster A) and an example of a pattern (pattern N A ) characteristic of the non-Luminal A cases. Notice that P is satisfi ed by all the samples in the Luminal A group, while N is satisfi ed by 88% of the non-Luminal A cases. Both patterns P and N are expressed as bounding constraints on the expressions of genes Liv-1 and Gata-3. all possible dimensions without allowing any member of the opposite class to be included in the box. The maximal boxes so obtained defi ned the patterns.

Bhanot et al
The pattern parameters (degree, prevalence, and homogeneity) were determined by estimating the classifi cation accuracy of a weighted-voting model constructed on pattern data through 10-fold crossvalidation experiments. Pattern-based weighted voting is a meta-classifi cation scheme in which individual patterns are "voters" for a phenotype. The performance of a multi-pattern meta-classifi cation system is better than the performance of single patterns if the patterns are uncorrelated (Merz, 1998). Uncorrelated patterns were selected by requiring the patterns to be defi ned on non-overlapping subsets of features. To avoid over-fi tting, the patterns were required to use no more than fi ve genes each.
We found many patterns of degree 2 and 3 for each phenotype, each of which was common to more than 90% of the samples in the cores. Table 3 presents some of these patterns. The striking feature of Table 3 is that simple conditions on a few genes are able to generate a very clean classifi cation in the cores. Several genes occurred frequently in the patterns, suggesting an active association with disease. For example, KIAA1691, PREP, CX3CL1, LIV-1, PLOD, GATA-3 occur in 20% of patterns for Luminal A, while PRAME, PLAT, CCNE1, FKHL7, clone MGC:22588 IMAGE:4696566, occur in 15% of the patterns for the Basal group. There are also several genes which are good uni-gene markers but are not found in patterns.

Consistency of Core Assignments Using Either Patterns or Clustering
A positive pattern is a set of conditions satisfi ed by a sample that belongs to a core cluster. A negative pattern is a set of conditions satisfi ed by a sample that belongs to the complement of the core cluster. For each unlabeled sample we counted the number of positive minus the number of negative patterns satisfi ed by it for each core cluster. The sample was assigned to the core cluster for which the ratio obtained by dividing this number to the total number of patterns for the core cluster, was positive and maximum. If the maximum ratio was negative or if it was assigned to multiple core clusters then the sample remained unclassifi ed (Alexe et al. 2005c). The classifi cation of samples to cores was validated using leave-one-out experiments on patterns. Over the sixty samples in the cores, in each such experiment, the entire procedure (gene selection, pattern extraction and sample classification) was repeated sixty times, once for each omitted sample.
A comparison of our clustering and pattern assignments with the original classification is presented in Table 4. The color scheme is that if the sample is robustly assigned to a phenotype, its entry is the color of that phenotype. Samples whose classifi cation is either poor or ambiguous are in black or left blank respectively. When the Bhanot et al       Table 5. Phenotype prediction for previously unassigned breast cancer samples. pattern and cluster classifi ers agree, the assignment can be considered accurate. When they differ, no classifi cation is possible. From a treatment perspective, the recommendation of such an inconclusive assignment would be retesting. The clustering and patterns classifi ers for the unassigned samples in the Sorlie et al. paper are shown in Table 5. Some of these originally unassigned samples are assigned to a consistent phenotype by our methods. Table 6 summarizes the sensitivity and specifi city of the pattern based classifi er showing once again the robustness of the classification into phenotypes Normal, Luminal A and Basal and the unreliability of the other two phenotype classifi cations.

Validation on an External Dataset Data 2
We used the markers identifi ed in Data 1 to classify samples in Data 2. These two datasets had 93 genes in common. Of these, 79 were in our 391 uni-gene set and a subset of 38 of these were in the smaller subset of 148 genes. Of the latter, 23 were markers for Luminal A, 4 were markers for Luminal B, 3 were markers for ERBB2+ and 12 were markers for the Basal group. For each of the 38 genes, we normalized the data sets relative to each other by equating the average intensity of each gene for the normal samples in the two data sets. In each dataset, the expression level of each gene was replaced with its quartile value across all samples. We recomputed a pattern-based classifi er trained on the known core clusters in the Sorlie et al. (2003) data and used it to predict the phenotype for Ma et al. 2003 samples. Figure 4 shows a heat map of the 38 genes in common between the datasets. This plot includes all core samples from Data 1 and all samples from Data 2. The Normal samples from both sets cluster nicely showing that the global normalization was done correctly. The Luminal A cluster is easily identifi ed because all Luminal A core samples from Data 1 cluster together with several samples from Data 2. There is also a distinct Basal cluster with most Data 1 Basal samples and a few Data 2 samples on its edges. Finally, there is another cluster with some Core B samples which looks quite similar to Luminal A. The core C samples are mixed in with the Basal cluster (as was already noticed in Figure 1c). We conclude that it is not possible to assign Luminal B or ERBB2+ phenotypes to samples in Data 2 based on Data 1 because a) There are very few genes in these categories (3/38 for ERBB2+ and 4/38 for Luminal B), b) the ERBB2 gene is missing in Data 2 and c) The quality of the patterns using the 38 genes for these two phenotypes is poor. Indeed, for core C, there are no patterns at all and for core B, the patterns are of poor statistical quality.
To further validate the consistency of our assignments, we trained a pattern-based classifi cation model on quartile discretized Data 1 samples and used it to predict the phenotype for the samples in Data 2 using majority voting. When the prediction from patterns agreed with the prediction from clustering as in Figure 4, we felt confi dent of the diagnosis, otherwise not. Our predicted phenotypes for the Ma et al. data are given in Table 7.

Pathways for each Core
To identify processes/pathways that are common and particular to the different phenotypes, we used the bioinformatics public resources DAVID (Dennis et al. 2003), BioRag (Pandey et al. 2004), Table 7. Predicted phenotype for samples in Ma et al. data using patterns from core clusters in Sorlie et al. 2003. We are confi dent of the phenotype assignment for those samples marked in color in columns 9 and 10 .  y  a  w  h  t  a  p  r  o  e  p  y  t  r  e  c  n  a  c  d  e  t  a  l  e  R  y  a  w  h  t  a  P  k  n  a  B  e  n  e  G  n  o  i  t  p  i  r  c  s  e  d  e  n  e  G  p  u  o  r  G   d  e  t  a  l  e  r  r  e  c  n  a  c  t  s  a  e  r  B  s  r  o  t  p  e  c  e  R  _  r  a  e  l  c  u  N  2  0  7  1  9  2  A  A  1  r  o  t  p  e  c  e  r  n  e  g  o  r  t  s  e  1  R  S  E KIT v-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene homolog

N20798
Regulation of BAD phosphorylation Breast cancer related, Loss of c-kit expression has been reported in 80-90% of breast cancer specimens, suggesting a possible role in the development of tumors. Introduction of the c-kit gene leads to growth suppression of a breast cancer cell line, MCF-7 (Nishida et al., 1996) (Takahashi et al., 2002) and endometrial adenocarcinomas (Maatta et al., 2004).   Tables 2a -d). Table 9 presents the GO categories enriched for the genes associated with the cores. The statistical signifi cance of the enriched GO categories is computed as described in Supplementary Information II. The complete list of gene markers for the core phenotypes involved (continued) in the enriched GO categories is available in Supplementary Table 4. Whereas we discuss markers for each core subtype, we have strong confi dence only in the markers for Luminal A and Basal.
In Luminal A, ESR1 is up-regulated, indicating that the estrogen receptor pathway is turned on.
The KIT gene was already known to be lost in breast cancer. Introduction of the c-kit gene leads to growth suppression of a breast cancer cell line, MCF-7 (Nishida et al. 1996). The Neuregulin 1 gene, which is up-regulated, is a direct ligand for ERBB3 and ERBB4, and an indirect activator of ERBB2, though the ERBB2+ subtype is identifi ed with Cluster C. The nuclease sensitive element binding protein (NSEP1), which is also up-regulated, is known to inhibit p53 induced apoptosis (Zhang et al. 2003). It has also been recently shown to be a target of Akt phosphorylation, and that disruption of phosphorylation inhibits tumor growth (Sutherland et al. 2005). This gene is involved in D4-GDI signaling pathway, which may also be up-regulated.
A number of Luminal A markers were previously identifi ed cancer related genes. The ID4 gene, which was also reported to be down-regulated in gastric adenocarcinoma and leukemia, may cause the alteration of the TGF-beta signaling pathway which regulates the growth and proliferation of cells, blocking the growth of many different cell types. The TGF-beta receptor includes Type I and Type II subunits that are serinethreonine kinases that signal through the Smad family of proteins. Another cancer related gene is GSTP1, which was reported to be lost in different types of cancers including prostate cancer, lung cancer and squamous cell carcinoma. Other cancer related genes include the TFF3 gene, which was shown to activate STAT3, (an oncogene) signaling in human colonic cancers (Rivat et al. 2005) and the VEGF receptor FLT1 gene.
Biomarkers for Cluster B (Luminal B) include fibroblast growth factor FGFR4 which might be from the fact that this family of genes is known to be overexpressed in cancers of the cervix and bladder, though their role in breast cancers is more controversial (Streit et al. 2004;Jezequel et al. 2004); two cancer related genes: Gamma-glutamyl hydrolase (GGH) gene, which was also identified as a biomarker for pulmonary neuroendocrine tumors (He et al. 2004), and laminin, gamma 2 (LAMC2) gene, which was reported to be involved in tumor invasion and metastases in pancreatic ductal adenocarcinoma (Takahashi et al. 2002) and endometrial adenocarcinomas (Maatta et al. 2004). The latter gene is down-regulated in the breast cancer data sets analyzed here.
Generally, Cluster C (ERBB2+ subtype) biomarkers appear to be mostly receptors, receptor binding proteins and signal transduction related proteins (Table 9). As expected, the most characteristic of these genes is the up-regulated ERBB2 gene. Other important genes include two breast cancer related genes, namely, the F2R gene, a matrix metalloprotease-1 receptor that promotes invasion and tumorigenesis of breast cancer cells (Boire et al. 2005); and PPAR binding protein, coactivator of ESR1 and overexpressed in breast cancer (Zhu et al. 1999). The down-regulation of FLNB fi lamin B alters the MAP Kinase pathway with implications in both growth control and development.
The marker genes for the Basal phenotype (Cluster D) are significantly involved in cell cycle, regulation of cell proliferation, endoplasmic reticulum as well as in various metabolic processes. Important cancer related genes identifi ed for this phenotype are CDK6 gene, which inhibits proliferation of human mammary epithelial cells (Lucas et al. 2004); SIAT4C, which is down-regulated in RCC (Saito et al. 2002), RHOB, which is known to be a pro-apoptotic and tumor suppressor gene, and the FLT1 and TFF3 gene. Plasminogen activator gene (PLAT) is involved in tissue remodeling while fi bromodulin (FMOD) gene has a primary role in collagen fi brillogenesis.
The last of the clusters is the control or normal group. Here we fi nd that the genes identifi ed as signifi cant markers are involved in organelle organization and biogenesis, cytoskeleton organization and biogenesis, or in metabolic pathways (e.g. cofactor biosynthesis). These represent genes that are pathologically expressed in all tumor strata; consequently they are able to robustly stratify BCA samples from control (Normals).
Overall, the biomarkers notably constitute genes that participate in breast cancer related pathways (e.g. marker genes involved in estrogen receptor pathway) and genes that were previously implicated in other cancer types (e.g. GSTP1, FLT1, see Table 8). Moreover, the enriched categories in each phenotype are biologically plausible, having already been implicated in cancer transformation (e.g. cell cycle, cell motility, cytoskeleton organization) (Hanahan and Weinberg, 2000) or being potentially important in transformation (signal transduction pathways, metabolism).

Summary and Discussion
We have presented a robust clustering and pattern based analysis of the phenotypes identified by Sorlie et al. 2003. We fi nd that the clusters for Luminal A, Basal and Normal subtypes are homog-enous and have predictive content. However, the Luminal B and ERBB2+ assignments are sensitive to data perturbations. One reason for this is that the genes chosen for the classifi cation are too few and not appropriate for these two categories. This is evidenced by the fact that the number of genes for Luminal B and ERBB2+ that pass our stringent robustness fi lters is small. Another reason is that hierarchical clustering is inappropriate to resolve the subtleties of the Luminal B and ERBB2+ categories. Finally, these subtypes are more heterogeneous than Luminal A and Basal and possibly have further substructure not classifi able with the genes in this dataset. A larger number of samples and better/more genes are necessary to test these conclusions.
Several samples previously unclassified in Sorlie et al. 2003 were classifi able by our techniques. We also found several samples which show a complex (multiple) phenotype signature. Given the treatment implications, the patients from whom these samples were taken should undergo further analysis or different treatment.
We also describe a general method to deal with sensitivity to noise in gene array data, which often confounds the analysis. There are four principal sources of noise. The fi rst, which we cannot do anything about, is the experiment itself: a) different samples handled differently in and experiment or between different labs; b) data improperly collected or improperly recorded/measured; c) microarray or cDNA readout with missing or unreliable entries. The second type of "noise" is stochastic noise; from statistical errors in the measurement of the signal or from normal variation within a phenotype in the sample population. We show how to partially account for this noise by data perturbations and consensus analysis. A third source of noise is the data analysis methods used. In particular, there are many different defi nitions of distance between gene expression vectors and many different clustering techniques. These often lead to different clusters depending on parameter choices, and to clusters that are unstable to perturbations. Our method robustly deals with this issue to get reliable predictions. A fourth source of noise derives from the genes selected as the basis for the analysis (Ein-Dor et al. 2005). This set results both from the initial choice of genes on the chip and the subset of genes that is used in the clustering. The choice of genes on chips will improve only if chip manufactures come up with better chips, possibly motivated by the biology of the underlying processes. However, given a gene set, this paper describes a procedure to select a data perturbation independent and predictive subset of the genes.
The fundamental requirement of any clustering analysis is the assignment of confi dence levels to clusters. This is particularly important in gene expression analysis where a small sample set is clustered using a large set of noisy genes which makes the clustering results sensitive to noise and susceptible to over-fi tting. Our methods use re-sampling and cross validation to simulate perturbations of the data, and this allows us assess the stability of the clustering with respect to sample variability.
In functional genomics, agglomerative hierarchical clustering (HC) has been widely adopted as the unsupervised analysis tool of choice, mainly because of its intuitive appeal and its visualization properties. By not committing to a specifi c number of clusters, HC provides for a multi-resolution view of the data that can be extremely useful in exploratory data analysis. However, the method does not provide for an "objective" criterion to establish the number of clusters and the clusters' boundaries. Furthermore, the resulting trees are known to be highly unstable to small perturbations of the data. The trees also tend to preserve sample joining errors made at earlier stages.
To correct for these problems, we recommend averaging over perturbations of the original data. The hierarchical clustering algorithm can then be applied to each of the perturbed data sets, and the agreement, or consensus, among the multiple runs can be assessed. This technique will measure the "stability" of the discovered clusters to sampling variability. The basic assumption of the method is intuitively simple: if the data represent a sample of items drawn from distinct sub-populations, and if we were to observe a different sample drawn from the same sub populations, the induced cluster composition and number should not be radically different. Therefore, the more the attained clusters are robust to sampling variability, the more confident we can be that these clusters represent real structure. Overall, the procedures suggested here will be of use in examining any data in a way that makes the predictions insensitive to stochastic and systematic variation.
A frequent concern in gene-array data and analysis is whether the data is reproducible, and whether the inferences are consistent with current biological knowledge. In this paper we address the fi rst issue by applying the results of our analysis on one data set to make predictions on another. For the phenotypes which cluster well, we can make defi nite predictions on the unseen data. In addition, we identify pathways via genes whose markers are predictive of phenotype. It is likely that these genes have only diagnostic value, i.e. they are downstream effects of an established disease process whose cause is outside the identifi ed set of genes. This is a problem with most microarray data which is usually available only for cells which show established disease.

Supplementary Information I: Multiple Testing Correction Metrics
The general multiple hypothesis testing analysis used in our paper results in the following matrix: We use the following statistics to analyze this table.
False discovery rate (FDR). The FDR  is the expected proportion of Type I errors among the rejected hypotheses: FDR = E(Q); with Q = V/R if R > 0 and Q = 0; if R = 0.
The q-value of a gene  is defi ned as the minimal FDR at which it appears signifi cant.
Family-wise error rate (FWER, . The FWER is defi ned as the probability of at least one Type I error (false positive): FWER = Pr(V > 0) The Bonferroni correction ) : Suppose we conduct a hypothesis test for each gene g = 1,…,N, producing an observed test statistic: T g , an unadjusted p-value: p g . = the probability under the null hypothesis that the test statistic is at least as extreme as T g . Under the null hypothesis, Pr(p g < a ) = a. Bonferroni adjusted p-values: p g = min (1, N p g .)